Analysis Airbnb

import libraries

Collect Data

AirBnb Data

As of Jan 1, 2022, we can find 11 AirBnb data sets on http://insideairbnb.com/get-the-data.html, which were scrapped one time per month between Dec 2020 and Nov 2021, with May 2021 skipped. The 11 datasets where downloaded and put in 11 folders.

Reference Data

Get two reference data:

We choose data of confirmed cases in NSW instead of Australia based on the assumption that the Covid situation in NSW has stronger impact on NSW AirBnb business than the Covid situations outside of NSW.

We choose data of Google Trends queried by users in Australia instead of NSW based on the assumption that NSW AirBnb guests are usually from all over Australia instead of NSW local residents. Opinions on Covid of people all over Australia may influence their decision on whether to book a NSW AirBnb property or not.

The numbers in Google Trends data indicate how the users' "interest" in term Covid changed during the given period. Roughly speaking, the highest frequency of search happening at a timepoint within the period is normalized to 100, and the frequencies happening at rest of the timepoints are normalzied to numbers <= 100, by comparing the frequencies against the highest.

1. AirBnb Operating Property Number and Occupancy

In this use case we estimate the number of AirBnb properties in operation and their occupancy rate in the period between Dec, 2020 and Nov, 2021. The data were collected by 11 scrapings, one scraping executed in each month, with May 2021 being skipped. One scraping usually lasted for 1-2 day.

We treat the status of properties at the time they were scraped as a snapshot, and therefore as a sample of their status in usual days. We estimate the number of properties in operation by counting the listed properties at scraping time, and whether the properties were occupied/vacant by checking if they were available for booking at scraping time.

The second estimation is based on a simplistic assumption that "available" means "vacant", and "unavailable" means "occupied". In reality, a property can be unavailable for two reasons, namely it is booked/occupied by guests, or the host of the property blocks booking. We assume the second reason for unavailable is not common, and can be safely ignored in the analysis.

We observe changes of property number and occapancy rate from the dataset. In pursuing the causes of the changes, we use two reference data, New South Wales confirmed Covid case data and Google Trends data on query term Covid.

Load and preprocess listing and calendar data in AirBnb dataset.

Load and preprocess reference data.

Visualize the following in diagrams:

Preliminary analysis of the diagram above

An initial and preliminary analysis of the diagram suggests:

2. Analysis of property number and vacancy rate by neighbourhoods

TBD

3. Analysis of price

TBD

4. Predict price by description and/or review

TBD

is it achivable?

Below are messy material to be used for uses cases

look into listing.csv and listing.csv.gz

Have a look into number of properties and number of hosts

Get location (point) by name.

look at listing prices

look into reviews.csv

look into neighbourhoods.csv and neighbourhoods.geojson